Show&Tell: A Semi-Automated Image Annotation System
نویسندگان
چکیده
A multimedia system for semi-automated image annotation, Show&Tell combines advances in speech recognition, natural language processing, and image understanding. Show&Tell differs from map annotation systems and has tremendous implications for situations where visual data must be coreferenced with text descriptions, such as medical image annotation and consumer photo annotation. S how&Tell takes advantage of advances in speech technology and natural lan-guage/image understanding research to make the preparation of image-related information more efficient. Specifically, we aim to identify relevant objects and regions in the image, as well as to attach text descriptions to them. We use a combination of automated and semi-automated image understanding tools in object and region identification. Image analysts can use Show&Tell in applications where text descriptions must be corefer-enced with image areas, such as medical image annotation, National Aeronautics and Space Administration (NASA) space photo annotation, and even consumer photo annotation. Medical images suit our system well, since radiologists already employ speech to dictate their findings and robust image understanding technology is available for several areas, such as chest and lung radiographs. In a joint effort with Kodak, we are adapting our system for consumer photo annotation. Since still cameras can be fitted with microphones, speech annotation of photos is now possible. Consumers will be able to easily create searchable digital photo libraries of their pictures and focus primarily on pictures of people in various contexts. Multimedia input analysis Multimedia systems involving speech and deic-tic input can be classified into two major categories: multimedia input analysis and multimedia presentation. Our work focuses on the former. Our system differs from previous work in the area of adding text annotations to pictorial data in the following ways. Most systems assume that there already exists an underlying semantic representation of the pictorial data. We don't. We used Clark's terminology. 1 The region to which the user points is the demonstratum, the descriptive part of the accompanying text is the descrip-tor, and the region to which the user intends to refer is the referent. Much of the recent work in multimedia input analysis concerns disambiguat-ing ambiguous deictic references, that is, determining which of the possible referents that map to the same demonstratum is intended by the user. 2 Accompanying linguistic input, in the form of speech, is used for this purpose. Such systems assume that the type of deixis being used, known as demonstratio ad oculus, is distinguished by the fact …
منابع مشابه
Show&tell: a semi-automated image annotation system - Multimedia, IEEE
A multimedia system for semi-automated image annotation, Show&Tell combines advances in speech recognition, natural language processing, and image understanding. Show&Tell differs from map annotation systems and has tremendous implications for situations where visual data must be coreferenced with text descriptions, such as medical image annotation and consumer photo annotation. S how&Tell take...
متن کاملShow and Tell: Using Speech Input for Image Interpretation and Annotation
This research concerns the exploitation of linguistic context in vision. Linguistic context is qualitative in nature and is obtained dynamically. We view this as a new paradigm which is a golden mean between data driven object detection and site-model based vision. Our solution not only proposes new techniques for using qualitative contextual information, but also efficiently exploits existing ...
متن کاملFuzzy Neighbor Voting for Automatic Image Annotation
With quick development of digital images and the availability of imaging tools, massive amounts of images are created. Therefore, efficient management and suitable retrieval, especially by computers, is one of themost challenging fields in image processing. Automatic image annotation (AIA) or refers to attaching words, keywords or comments to an image or to a selected part of it. In this paper,...
متن کاملImage flip CAPTCHA
The massive and automated access to Web resources through robots has made it essential for Web service providers to make some conclusion about whether the "user" is a human or a robot. A Human Interaction Proof (HIP) like Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) offers a way to make such a distinction. CAPTCHA is a reverse Turing test used by Web serv...
متن کاملTags Re-ranking Using Multi-level Features in Automatic Image Annotation
Automatic image annotation is a process in which computer systems automatically assign the textual tags related with visual content to a query image. In most cases, inappropriate tags generated by the users as well as the images without any tags among the challenges available in this field have a negative effect on the query's result. In this paper, a new method is presented for automatic image...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IEEE MultiMedia
دوره 7 شماره
صفحات -
تاریخ انتشار 2000